18 research outputs found

    Using an Adaptive HPC Runtime System to Reconfigure the Cache Hierarchy

    Full text link
    The cache hierarchy often consumes a large portion of a processor’s energy. To save energy in HPC environments, this paper proposes software-controlled reconfiguration of the cache hierarchy with an adaptive runtime system. Our approach addresses the two major limitations associated with other methods that reconfigure the caches: predicting the application’s future and finding the best cache hierarchy configuration. Our approach uses formal language theory to express the application’s pattern and help predict its future. Furthermore, it uses the prevalent Single Program Multiple Data (SPMD) model of HPC codes to find the best configuration in parallel quickly. Our experiments using cycle-level simulations indicate that 67 % of the cache energy can be saved with only a 2.4 % performance penalty on average. Moreover, we demonstrate that, for some applica-tions, switching to a software-controlled reconfigurable streaming buffer configuration can improve performance by up to 30 % and save 75 % of the cache energy. I

    Parallelizing Julia with a Non-Invasive DSL

    Get PDF
    Computational scientists often prototype software using productivity languages that offer high-level programming abstractions. When higher performance is needed, they are obliged to rewrite their code in a lower-level efficiency language. Different solutions have been proposed to address this trade-off between productivity and efficiency. One promising approach is to create embedded domain-specific languages that sacrifice generality for productivity and performance, but practical experience with DSLs points to some road blocks preventing widespread adoption. This paper proposes a non-invasive domain-specific language that makes as few visible changes to the host programming model as possible. We present ParallelAccelerator, a library and compiler for high-level, high-performance scientific computing in Julia. ParallelAccelerator\u27s programming model is aligned with existing Julia programming idioms. Our compiler exposes the implicit parallelism in high-level array-style programs and compiles them to fast, parallel native code. Programs can also run in "library-only" mode, letting users benefit from the full Julia environment and libraries. Our results show encouraging performance improvements with very few changes to source code required. In particular, few to no additional type annotations are necessary

    Parallelizing Julia with a Non-Invasive DSL (Artifact)

    Get PDF
    This artifact is based on ParallelAccelerator, an embedded domain-specific language (DSL) and compiler for speeding up compute-intensive Julia programs. In particular, Julia code that makes heavy use of aggregate array operations is a good candidate for speeding up with ParallelAccelerator. ParallelAccelerator is a non-invasive DSL that makes as few changes to the host programming model as possible

    Power, Reliability, Performance: One System to Rule Them All

    Get PDF
    En un diseño basado en el marco de programación paralelo Charm ++, un sistema de tiempo de ejecución adaptativo interactúa dinámicamente con el administrador de recursos de un centro de datos para controlar la energía mediante la programación inteligente de trabajos, la reasignación de recursos y la reconfiguración de hardware. Gestiona simultáneamente la fiabilidad al enfriar el sistema al nivel óptimo de la aplicación en ejecución y mantiene el rendimiento a través del equilibrio de carg

    Energy-efficient computing for HPC workloads on heterogeneous manycore chips

    Full text link
    Power and energy efficiency is one of the major challenges to achieve exascale computing in the next several years. While chips operating at low voltages have been studied to be highly energy-efficient, low voltage operations lead to heterogeneity across cores within the microprocessor chip. In this work, we study chips with low voltage operation and discuss programming systems, and performance modeling in the presence of heterogeneity. We propose an integer linear programming based approach for selecting optimal configu-ration of a chip that minimizes its energy consumption. We obtain an average of 26 % and 10.7 % savings in energy con-sumption of the chip for two HPC mini-applications- min-iMD and Jacobi, respectively. We also evaluate the energy savings with execution time constraints, using the proposed approach. These energy savings are significantly more than the savings by sub-optimal configurations obtained from heuristics

    Parallel Programming with Migratable Objects: Charm++ in Practice

    Get PDF
    The advent of petascale computing has introduced new challenges (e.g. Heterogeneity, system failure) for programming scalable parallel applications. Increased complexity and dynamism in science and engineering applications of today have further exacerbated the situation. Addressing these challenges requires more emphasis on concepts that were previously of secondary importance, including migratability, adaptivity, and runtime system introspection. In this paper, we leverage our experience with these concepts to demonstrate their applicability and efficacy for real world applications. Using the CHARM++ parallel programming framework, we present details on how these concepts can lead to development of applications that scale irrespective of the rough landscape of supercomputing technology. Empirical evaluation presented in this paper spans many miniapplications and real applications executed on modern supercomputers including Blue Gene/Q, Cray XE6, and Stampede

    Simulation-based performance analysis and tuning for future supercomputers

    Get PDF
    Hardware and software co-design is becoming increasingly important due to complexities in supercomputing architectures. Simulating applications before there is access to the real hardware can assist machine architects in making better design decisions that can optimize application performance. At the same time, the application and run-time can be optimized and tuned beforehand. BigSim is a simulation-based performance prediction framework designed for these purposes. It can be used to perform packet-level network simulations of parallel applications using existing parallel machines. In this thesis, we demonstrate the utility of BigSim in analyzing and optimizing parallel application performance for future systems based on the PERCS network. We present simulation studies using benchmarks and real applications expected to run on future supercomputers. Future peta-scale systems will have more than 100,000 cores, and we present simulations at that scale

    Power and energy management of modern architectures in adaptive HPC runtime systems

    Get PDF
    Power and energy efficiency are important challenges for the High Performance Computing (HPC) community. Excessive power consumption is a main limitation for further scaling of HPC systems, and researchers believe that current technology trends will not provide Exascale performance within a reasonable power budget in near future. Hardware innovations such as the proposed Exascale architectures and Near Threshold Computing are expected to improve power efficiency significantly, but more innovations are required in this domain to make Exascale possible. To help shrink the power efficiency gap, we argue that adaptive runtime systems can be exploited. The runtime system (RTS) can save significant power, since it is aware of both the hardware properties and the application behavior. We use application-centric analysis of different architectures to design automatic adaptive RTS techniques that save significant power in different system components, only with minor hardware support. In a nutshell, we analyze different modern architectures and common applications and illustrate that some system components such as caches and network links consume extensive power disproportionately for common HPC applications. We demonstrate how a large fraction of power consumed in caches and networks can be saved using our approach automatically. In these cases, the hardware support the RTS needs is the ability to turn off ways of set-associative caches and network links. We also present some required RTS techniques, such as recognizing the running application’s pattern using pattern recognition to predict its future and adapt the hardware appropriately. Furthermore, we address two types of prevalent heterogeneity: utilization of accelerator devices and process variation. To study accelerators, we analyze and optimize an example application on a heterogeneous architecture and demonstrate techniques for efficient mapping on different devices (CPU and GPU). To address process variation challenges, we develop accurate models that let the RTS schedule efficiently in the presence speed and power consumption variation. Using the models, we develop a novel scheduling framework that uses integer linear programming to enforce different performance and power consumption constraints
    corecore